An Introduction to Pattern Matching

TextPipe: Online Help An Introduction to Pattern Matching	Submit feedback on this topic
Home User Assistance Tutorials How to Use TextPipe Menus: File Edit Filters[ Convert Add Remove Unicode Replace Special Map Email Restrict ] Tools Window Help Advanced

Home
Up
Regular Expression Examples

For more practical examples, take a look at the section, Examples Using Regular Expressions.
Perl/PCRE pattern matching reference
For further reference, we recommend Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools (O'Reilly Nutshell) -- Jeffrey E. Friedl (Editor), Andy Oram (Editor); Paperback

Why do we need patterns?

We need patterns to describe text that is inexact, such as 'all words starting with s', 'all words starting with t and ending with m', or 'all 9 digit numbers'. We might be able to list all the possible variations, but it would be impractical and annoying.

We use a pattern matching language to describe a pattern. A pattern matching language uses symbols to describe both normal characters (that are matched exactly) and meta characters (that describe special operations, such as alternatives and repeating sections). Patterns are almost always described using what is called a regular expression, so called because the pattern matching language fulfills some mathematical properties that we don't need to worry about. Patterns can become very complicated very quickly, so let's start with some simple examples.

We've all done a search and replace where we don't care about the case of the search string (Case Sensitive is Off). For example, if I search for 'Car', I also want to search for 'car, 'CAR' and any other case variations. We are implicitly searching for an upper or lower case 'C', followed by an upper or lower case 'A' followed by an upper or lower case 'R'. This is an example of a simple pattern.

In a pattern matching language, we could express this as follows:

[Cc][Aa][Rr]

Each set of square brackets is a character class, meaning 'find one of any of these characters' - a 'C' or 'c', then a 'A' or 'a', then a 'R' or 'r'. If we changed this pattern to

[BbCc][Aa][Rr]

then we also could find 'Bar', 'bar' and 'BAR' (and others). Of course, if we just toggled the Case Sensitivity flag then it wouldn't matter if we used

[bc][a][r] or [BC]AR or [bC]Ar or even [bc]ar

Note that the letters outside of the square brackets simply represent themselves - they are not special.

Here are some more character classes:

[aeiou]	matches any lower case vowel
[^aeiou]	matches any character that is NOT a lower case vowel
[0123456789]	matches any digit
[^0123456789]	matches any character that is NOT a digit
]	NOT special, NOT the end of a character class when on its own
[]abc] [abc\]] [^]abc]	matches a, b, c and ] matches a, b, c and ] matches everything except a, b, c and ] If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash. It is not possible to have the literal character "]" as the end character of a range.
[0123456789^] [\^0123456789]	Matches 0 to 9 and ^ Matches 0 to 9 and ^ If a circumflex is actually required as a member of the class, ensure it is not the first character, or escape it with a backslash.
[0-9] [d-m] [0-9a-zA-Z] [-0-9] [0-9-] [\000-\037]	Matches 0 to 9 Matches d to m Matches 0 to 9, a to z, A to Z Matches 0 to 9 and - Matches 0 to 9 and - Matches ASCII characters 0 to 31 (non printable characters) The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class. Ranges operate in ASCII collating sequence. They can also be used for characters specified numerically, for example [\000-\037].

TextPipe also provides some convenient short cuts for commonly used classes. They may be used either on their own or inside a character class.

.	Any character, including newline (by default)
\d	any decimal digit
\D	any character that is not a decimal digit
\s	any white space character. space, formfeed, newline, carriage return, horizontal tab, and vertical tab
\S	any character that is not a white space character
\w	any "word" character. A "word" character is any letter or digit or the underscore character. The definition of letters and digits may vary, for example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
\W	any "non-word" character

You can also describe 'special' or unprintable characters using the following patterns:

\t	tab (decimal 9, hex 09)
\f	form feed (decimal 12, hex 0C)
\n	new line (decimal 10, hex 0A)
\r	carriage return (decimal 13, hex 0D)
\a	alarm, that is, the BEL character (decimal 7, hex 07)
\e	escape (decimal 27, hex 1B)
\cx	"control-x", where x is any character The precise effect of "\cx" is as follows: if "x" is a lower case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. Thus "\cz" becomes hex 1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
\xhh	character with hex code hh After "\x", up to two hexadecimal digits are read (letters can be in upper or lower case).
\ddd	character with octal code ddd, or back reference After "\0" up to two further octal digits are read. In both cases, if there are fewer than two digits, just those that are present are used. Thus the sequence "\0\x\07" specifies two binary zeros followed by a BEL character. Make sure you supply two digits after the initial zero if the character that follows is itself an octal digit.

Repetition - quantifiers

Often you are looking for a pattern that is repeated. TextPipe provides a number of different methods to specify how many times a pattern is allowed to repeat. Each method is called a quantifier.

The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second. For example:

z{2,4}

matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special character. If the second number is omitted, but the comma is present, there is no upper limit; if the second number and the comma are both omitted, the quantifier specifies an exact number of required matches. Thus

[aeiou]{3,}

matches at least 3 successive vowels, but may match many more, while

\d{8}

matches exactly 8 digits. An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters. For convenience (and historical compatibility) the three most common quantifiers have single-character abbreviations:

*    is equivalent to {0,}
+    is equivalent to {1,}
?    is equivalent to {0,1}

Normally TextPipe will try to repeat matches as many times as possible - this is known as greedy (or maximal) matching. Using the Search and Replace Pattern Options dialog you can change this to non-greedy (or minimal) matching. If a quantifier is followed by a question mark, it inverts the default 'greediness' of the quantifier.

Pattern	Subject text (match shown in bold)	Greedy/Non greedy
<.*>	<html><p>Hello there</p></html>	Non-greedy, minimal match
<.*>	<html><p>Hello there</p></html>	Greedy, maximal match (matches entire text)
<.*?>	<html><p>Hello there</p></html>	Non-greedy, minimal match
<.*?>	<html><p>Hello there</p></html>	Greedy, maximal match (matches entire text)

Using these quantifiers we can now build the following examples:

a+	a, aa, aaa, aaaa etc
\d+	any decimal integer
\w+	any single word
\d{2,3} \d{3,4} \d{4}	A 7 or 8 digit phone number with a 2 or 3 digit area code
s.t	Matches the real words set, sat, sit (and others)

Alternatives

Vertical bar ('pipe') characters are used to separate alternative patterns. For example, the pattern

gilbert|sullivan

matches either "gilbert" or "sullivan". Any number of alternatives may appear, and an empty alternative is permitted (matching the empty string). The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used. If the alternatives are within a sub pattern (defined below), "succeeds" means matching the rest of the main pattern as well as the alternative in the sub pattern.

Positional matching or anchors

Sometime you only want a pattern to match when it is found at the start of a file, or the end of a line, or when it's a whole word. Various positional operators force the match to only occur when it's found in a specific position.

^	Almost always used at the start of a pattern, the circumflex forces the match to only occur at the start of a line (i.e. after a \n or ASCII 10) or at the start of the file. Circumflex need not be the first character of the pattern if a number of alternatives are involved, but it should be the first thing in each alternative in which it appears if the pattern is ever to match that branch. Instead of Circumflex, consider using the following assertion: (?<=\r\|\n\|\A) The reason for this change is a little technical. TextPipe's pattern matching engine expects Unix end of line characters (\n or ASCII 10), so Circumflex will work for Unix or PC files, but fails utterly on Macintosh files unless you use the assertion above or explicitly convert the file first.
$	Almost always used at the end of a pattern, the dollar character forces the match to only occur at the end of a line (i.e. before a \n or ASCII 10), or at the end of file. Dollar need not be the last character of the pattern if a number of alternatives are involved, but it should be the last item in any branch in which it appears. Dollar has no special meaning in a character class. Instead of Dollar, consider using the following assertion: (?=\r\|\n\|\z) The reason for this change is a little technical. TextPipe's pattern matching engine expects Unix end of line characters (\n or ASCII 10), so Dollar will work on Unix files, fail utterly on Macintosh files, and on PC files it will include the preceding \r (ASCII 13) as part of the line unless you use the assertion above or explicitly convert the file first.
\b	word boundary A word boundary is a position in the subject string where the current character and the previous character do not both match \w or \W (i.e. one matches \w and the other matches \W), or the start or end of the string if the first or last character matches \w, respectively.
\B	not a word boundary
\A	start of file
\Z	end of file or new line at end of file
\z	end of file The difference between \Z and \z is that \Z matches before a new line that is the last character of the string as well as at the end of the string, whereas \z matches only at the end.

Sub Patterns and the Replacement String

Sub patterns are delimited by parentheses (round brackets), which can be nested. Marking part of a pattern as a sub pattern does two things:

1. It localizes a set of alternatives. For example, the pattern

cat(aract|erpillar|)

matches one of the words "cat", "cataract", or "caterpillar". Without the parentheses, it would match "cataract", "erpillar" or the empty string.

2. It sets up the sub pattern as a capturing sub pattern. The text matches by a capturing sub pattern can be referred to later (such as in the replacement string) using the macros $0 (for the full matching text), $1 (for the first sub pattern), $2 ... $9, $a ...$z etc. Opening parentheses are counted from left to right (starting from 1) to obtain the numbers of the capturing sub patterns.

For example, if the string "the red king" is matched against the pattern

the ((red|white) (king|queen))

the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3.

The fact that plain parentheses fulfill two functions is not always helpful. There are often times when a grouping sub pattern is required without a capturing requirement. If an opening parenthesis is followed by "?:", the sub pattern does not do any capturing, and is not counted when computing the number of any subsequent capturing sub patterns. For example, if the string "the white queen" is matched against the pattern

the ((?:red|white) (king|queen))

the captured sub strings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of captured sub strings is 36 (0-9, a-z), and the maximum number of all sub patterns, both capturing and non-capturing, is 200.

Using parentheses appropriately in the search expression lets the program remember found text, to be used as replacement text. The simplest example of this, without the need for parentheses, is the complete found string, represented by the ‘$0’ character in the replacement string. For example, if your search Regular Expression was ‘test|trial|experiment’, and your replacement string was ‘<b>$0</b>’, every instance of the word ‘test’ in your document would be replaced by ‘<b>test</b>’, and similarly for ‘trial’ and ‘experiment’ (assuming this was an HTML document, this would have the effect of bolding these words). Note that if you want to include an actual ‘$’ in your replacement text, escape it, as in, ‘$$’.

You can use parentheses to remember specific parts of the found text. For example, the Regular Expression ‘<img src="([^"]+)">’ would find and match image tags in an HTML document (assuming they were formed exactly like this), and would remember the source image file. You can recall this by using ‘$1’ in your replacement string. So if your replacement string was ‘-- Image File: $1 --’, and the HTML file processed contained the string ‘<img src="/images/test.gif">’, that string would be replaced by ‘-- Image File: /images/test.gif --’.

You can remember multiple parts of the found text. For example, the Regular Expression ‘<img src="([^"]+)" alt="([^"]+)">’ would find and match a string such as ‘<img src="/images/test.gif" alt="My Cool Image">’. If the replacement string was ‘$2 ($1)’, then this image would be replaced by ‘My Cool Image (/images/test.gif)’.

Performance

The time which a search for a Regular Expressions takes can range from unimportant to unbearably slow. Seemingly small changes in a Regular Expression can make a world of difference. For example, the regular expression

‘[^"]*("[^"]+")[^"]*’

finds and matches quoted strings (which it remembers), as well as surrounding text without quotes. This particular Regular Expression executes quickly enough (counted in seconds, or less) if the file being processed actually has quoted strings in it. If, however, the file is of a reasonable size (say, 50 KB), and does not have any quoted strings in it, the search for this regular expression will take an incredible amount of time to complete – maybe minutes! It turns out that it is the first piece of the expression that is causing the problem. Removing it makes the search on the 50 KB file nearly instantaneous. Why? Because now the search is smart enough to realize that before it bothers matching anything else, it must at least find a double-quote. When it doesn’t find one, it’s finished it’s search.

So one trick to making fast regular expressions, is to form the regular expression in such a way that it can fail as early as possible. Try to put strings that must be matched, right at the beginning of the regular expression. Apart from this, experiment and see what works well and what doesn’t.

Things To Note - Escaping Meta characters

If you want to find any of the meta characters on their own, you must escape them with a backslash to prevent them being interpreted. When in doubt, quote all non-alphanumeric characters. The meta characters are:

Meta character	Purpose
\	first character of assertions, such as \w or \d
\|	OR; allows matching options
.	match any character
( ... )	grouping operator; builds $1, $2, etc.
[ ... ]	match any character within brackets
+	quantifier: match one or more times
?	quantifier: at most one match
*	quantifier: match zero or more times
{n,m}	quantifier: match between n and m times
^	match at start of line
$	match at end of line